RE: [-empyre-] archiving

"Paul Koerbin" <pkoerbin@nla.gov.au> · Fri, 4 Feb 2005 09:44:38 +1100

Adam asks the fair question, why do we have PANDORA when there is the
Internet Archive? On the face of it what the IA does would seem to solve
all the problems, however this is not the case. Rather the IA and
PANDORA are complimentary. 

The IA does continual crawling of the Internet trying to capture as much
as it can - but it cannot and does not capture everything. For example,
if the robot.txt rules deny access to indexing/harvesting robots the IA
does not copy the files. In PANDORA we seek permission from the
publisher prior to archiving so we are able to rightfully ignore the
robot.txt rules and capture content the IA will not. There is not
guarantee for us, for Australians, that the IA will capture what we
might consider important; and if it does happen to pick it up we could
have no expectation that it has been checked for completeness and
functionality.

Because PANDORA is selective we are able to highly reactive to events on
the net. If we spot something that is currently of note on Internet, it
can be archived in a timely manner. If you are doing continuous trawls
of the net you just have to hope that you will pick things up in a
timely manner, but you may not. As an example, we just archived the
Australian Labor Party web site to reflect the leadership changes
http://nla.gov.au/nla.arc-22093

More than half the titles in PANDORA are archived in an ongoing,
scheduled manner, so I suspect Adam just didn't try enough titles. The
frequency of archiving is an informed decision made by the curators (web
archivers). If you do continuous crawling of the net like the IA you may
archive the same document many times despite the fact that that will not
change over time. We determine a gather schedule that aims to gather all
the content over time but not necessarily all the changes that occur on
a site. If the online resource is static we only archive it once. Or if
changes ar minimal we may not archive it regularly etc. Since this work
is labour intensive we have to be pragmatic about this. But, have a look
at our archiving of Crikey http://nla.gov.au/nla.arc-13027 to see just
one example of a title with multiple archivings. At the risk of
labouring a point take a look a the harvest in PANDORA for 19 June 2003
with an infamous photograph!
http://pandora.nla.gov.au/pan/13027/20030619/www.crikey.com.au/index.htm
l
Now try the IA
http://web.archive.org/web/*/http://www.crikey.com.au
The IA did not harvest at that particular time. The photo is not there.

Because the IA relies only on automated harvesting - well they have to
given the scale they are working on - they will miss a lot of content
that can with a bit of extra tweaking be gathered or made functional.
Take, for example, Komninos's own site. Here is an IA version of the
site
http://web.archive.org/web/20020213075239/http://www.gu.edu.au/ppages/K_
Zervos/index.html
Note all the missing images. We, when archiving for PANDORA, would
consider the images an important part of the site so we would check the
site at the time of harvest (which is the only practical time to do it
if any fixes need to be made) to ensure we have gathered as much content
as possible and made the archived copy as functional as possible.
Compare with a version of the same site in PANDORA archived around the
same time:
http://pandora.nla.gov.au/pan/10267/20020306/www.gu.edu.au/ppages/K_Zerv
os/index.html

Take another couple or examples: RealMedia files are commonly referenced
with .ram metafiles which point to the actual media file, usually
delivered from a RealMedia server. Automated harvesting will gather the
.ram file but it will not (currently at least) be able to gather the .rm
or .ra file from the RealMedia server and re-reference the .ram file to
point to an archived copy of the .ra or .rm file. In PANDORA we have to
contact the publisher to get them to supply the media files and we
re-reference the .ram files to point to the archived version. This is
archiving; and this sort of work has to be done at the time. Secondly,
we have a number of subscription titles in PANDORA whereby we negotiate
with the publisher either to supply the files or provide us with logins
and passwords to harvest them. Here is an IA version for the
subscription title the Justinian. Try and look at the articles.
http://web.archive.org/web/20031205234822/http://www.justinian.com.au/
Compare with a PANdORA version archived around the same time:
http://pandora.nla.gov.au/pan/10397/20031112/www.justinian.com.au/index.
html

I hope this demonstrates that in many respects the IA and PANDORA do not
duplicate each other. The aim of the IA is to harvest as much as
possible while accepting that it will have many lacunae. It also has to
be careful legally so it does not archive sites with robot.txt
exclusions and will remove titles from the archive if requested by the
owners. PANDORA's mission is somewhat different. PANDORA aims to
contribute to the documentation of Australia's heritage (such a mission
would only be a very minor part of the IA's ambition). Hence we archive
in a timely fashion, seek permission, obtain missing content, do quality
checking to ensure the harvests are the best we can technically do. The
scope of PANDORA is smaller than the IA, but the IA complements PANDORA
by (hopefully) archiving Australian sites that fall outside the scope of
our current collecting priorities. 

One final point to make, the National Library has a statutory
responsibility to collect, preserve, describe and make accessible
Australia's documentary heritage. It would hardly be meeting this
requirement if we relied solely on a non-profit organisation based in
the USA whose interest in the Australian domain in only peripheral

Paul

-----Original Message-----
From: empyre-bounces@gamera.cofa.unsw.edu.au
[mailto:empyre-bounces@gamera.cofa.unsw.edu.au] On Behalf Of adam
Sent: Thursday, 3 February 2005 1:50 PM
To: Komninos Zervos
Cc: soft_skinned_space
Subject: [-empyre-] archiving 

hi,

I'm also curious about this topic. I guess one thing that puzzles me is
why there is an archive like PANDORA when the same material is archived
my
"The Internet Archive" (http://www.archive.org).

When I visit PANDORA I can get access to a list of sites available but
(and I only had a quick look so appologies if I have this wrong) only
one
version of each site is available. Some sites are archived more than
once
apparently, but I am unable to find these extra versions on the site.

When I look at the archive.org site I see many many versions of sites.

For example if I look at the "Artspace" archive of PANDORA, I see one
URL:
http://pandora.nla.gov.au/tep/36417

which leads to this version of the site, archived July 2003:
http://pandora.nla.gov.au/pan/36417/20030708/www.artspace.org.au/index.h
tml

Infact the PANDORA site states this is the only archived version of the
Artspace site.

Whereas if I look at archive.org for the same site I get this page:
http://web.archive.org/web/*/http://www.artspace.org.au

Which must contain at least 50 versions of the site from 1998 until the
present day, _including_ a version for July 2003.

I guess I am left wondering why is PANDORA necessary? I can see why
countries, states, cities, towns, and even clusters of houses might have
their own shared library of hard copy resources. This makes practical
sense, as sharing of these books cant happen unless you can get physical
access to them. Hence this kind of duplication of archive resources
looks
like a good idea. However, digital archives don't have the same kind of
geographic limitations. If there is an archive in San Francisco, i can
access it in any part of Australia where there is a phone line and a
computer (and I hear Australia has a fair distribution of these ;) ...
So
why the duplication of archive.orgs service? Or perhaps more accurately,
why the _partial_ duplication of archive.orgs service?

Does Pandora know something about archive.orgs technical service for
example? Is PANDORA sure it offers a better technical service than
archive.org, or there is there more to it than initially
meets the browser?

Kind regards,

adam

~/.nz

_______________________________________________
empyre forum
empyre@lists.cofa.unsw.edu.au
http://www.subtle.net/empyre